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1. Introduction 

This paper discusses structural risk minimization in the setting of classification 
with a reject option. Binary classification is about classifying observations that 
take values in an arbitrary feature space X into one of two classes, labelled —1 or 
+1. A discriminant function / : A" K yields a classifier sgn(/(a;)) G {—1, +1} 
that represents our guess of the label y of a future observation X and we 
err if the margin y ■ f{x) < 0. Since observations x for which the conditional 
probability 

ri{x) ^ F{Y ^ +1\X ^ x} (1) 

is close to 1/2 are difficult to classify, we introduce a reject option for classifiers, 
by allowing for a third decision, ® (reject), expressing doubt. 

We built in the reject option by using a threshold value < t < 1 as follows. 
Given a discriminant function / : A" ^ R, we report sgn(/(a;)) G {—1, 1} if 
> T, but we withhold decision if |/(a;)| < r and report ®. We assume 
that the cost of making a wrong decision is 1 and the cost of utilizing the reject 
option is d > 0. The appropriate risk function is then 

E [i{Yf{X))] = F{Yf{X) < -r} + dF{\Yf{X)\ < r} (2) 

'Research is supported in part by NSF grant DMS 0706829 



155 



M, Wegkamp/Lasso type classifiers with a reject option 



156 



for the discontinuous loss 

{1 if z < -T, 
d if|z|<T, (3) 
otherwise. 

Since we never reject if d > 1/2, see [H], we restrict ourselves to the cases 
< d < 1/2. The generalized Bayes discriminant function, minimizing ([2]), is 
then 

{-1 if ri{x) < d 
if d < vix) <l~d (4) 
+1 ifri{x)>l-d 

with risk 

E[min{r,iX),l-r^{X),d}], 

see [9l[13]. The case (r, d) = (0, 1/2) reduces to the classical situation without the 
reject option. We can view d as an upper bound on the conditional probability 
of misclassification (given X) that is considered tolerable. 
The estimators 

M 

of fo{x) that we study in this paper are linear combinations of base functions 
fj from a dictionary Fm — {/i, • • ■ , Iai}- We suggest regularized empirical risk 
minimization based using convex surrogate loss functions and a penalty term 
p{X) = 2r„|A|i that is proportional to the £i-norm |A|i of the parameter A. The 
regularized empirical risk 

1 " 

-J2HYMX,))+p{X) (5) 

is then convex in A and its minimization can be solved by a (tractable) convex 
program. 

The organization of the paper is as follows. Section [2] presents a general 
bound on the excess risk of minimizers A of the penalized empirical risk ([5|). 
We define an oracle target A*, that provides an ideal approximation f^* of /o 
with possibly many fewer elements fi of the dictionary Fm, and show under 
mild assumptions that this oracle target can be recovered by minimization of 
([5]), even if M is larger than n. We advance the use of a novel type of oracle 
inequality, explored in [SI [5] , where the aim is to show that the sum of the excess 
risk and the penalty term p{X — A*) achieves the optimal balance between the 
excess risk and a regularization term. This allows us to determine that the oracle 
can be recovered and gives us information about the £i-distance between A and 
the oracle vector A*. This extends the work of [H [51 [51 [7] on lasso- type estimators 
in regression and density estimation problems to empirical risk minimization of 
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the general criterion ([5]) in the context of classification with a reject option. We 
take a different approach than the recent technical report [T7]. In particular, we 
use the concept of mutual coherence, used in [H [5l [6l [7j , which is weaker than the 
corresponding requirement in |17j and give a different, simple proof of the main 
oracle inequality. We demonstrate that the choice of the the tuning parameter 
r„ in the penalty p{X) = 2r„|A|i is crucial. We prove that the oracle inequality 
holds on an event where r„ exceeds a certain random quantity r. Then we show 
that r is highly concentrated around its mean using McDiarmid's concentration 
inequality and provide an upper bound for E[r|. 

Section [3] applies the results of Section [2] to the specific generalized hinge loss 
function (j)d introduced in [1], extending the work |14j to classification with a 
reject option. This loss is convex, so that the minimization of ([5]) is computa- 
tionally feasible, and at the same time classification calibrated, as the minimizer 
oi K[(f)d{Y f {X))] is the Bayes discriminant /o, our parameter of interest. 

Finally, the proofs are collected in Section [H 



2. Oracle inequalities for the excess risk 
2. 1 . Preliminaries 

The data (Xi, Yi), . . . , (X„, y„) consist of independent copies of (X, Y) where 
X takes values in an arbitrary measurable space X and Y e {— 1,+1}. Let 
Fm = {/i, • ■ ■ , /a/} be a finite set of functions (dictionary) with ||/j||oo < Cp 
and we consider discriminant functions 

M 

We consider a loss function </) : M ^ [0, oo) that is Lipschitz, 

l</'(2/)-0(y')l <c^|y-2/'l 

with Crj, < oo and based on this loss function, we define the risk functions 

1 " 

i?^(A) = E [cj){Yh{X))] and R^X) = - V cj){YMX^)). 

We assume that /o defined in H]) minimizes the risk E[0(F/(X))] over all mea- 
surable / : A" — > R, and we denote its risk by i?o, that is, 

i?o=infE[(/.(y/(X))] 
We measure the performance of our estimators in terms of the excess risk 



A^(A) = i?^(A)-i?o. 
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Based on the penalty 

M 

p(A) = 2r„|A|i =2r„^|A,| 

with r„ specified later in Section 12.41 the penalized empirical risk minimizer A 
satisfies 

R4,(X) + p(X) < R^{X) + p{X) for all A eM^^. (6) 

In particular, fB]) ensures that for Ao = (0, . . . , 0), 

p(A) < i?0(A) +p(A) < i?^(Ao) +p(Ao) - 0(0) 

which in turn implies |A|i < 0(O)/(2r„). This means that we effectively minimize 
the penalized empirical risk R^{X) +p(A) over A in the set 

A„ = {A e R^' : |A|i < 0(O)/(2r„)} . 

2.2. Assumptions 

We impose two conditions. Given some finite measure fi on X, set 

< f,g>= J f{x)gix)fi{dx) and \\f\\'^^Jf{x)fi{dx). 

The first condition imposes a link between the distance ||fA — /o|| and excess 
risk A0(A): 

Condition 1. There exist Ca.^i < oo and < /3 < 1 such that, for all A £ A„, 

||fA-/o|| <Ca,^A^(A). (7) 

In regression and density estimation problems as considered in [H [3 [6l [7] , 
this condition trivially holds with /3 = 1/2 and Ca.^j, = 1- This relation is more 
delicate to establish in classification problems. It depends on the behavior of 
the conditional probability ri{X) near d and 1 — d, see Section [3] below. 

Our goal is to estimate /o via linear combinations ^\{x) and to evaluate 
performance in terms of the excess risk A0(A). For any / — {«!,..., C 
{!,..., M}, we define the approximating parameter space 

A(/) = {A e M^^ : Ai = for all i (jL l] 

and let A/ minimize Rtf,{X) over A(/). An oracle that knows /o would be able 
to tell us in advance which approximating space A(/) yields the smallest excess 
risk A^{Xi). However, /o is unknown so the best we can do is to mimic the 
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behavior of the oracle. General theory for empirical risk minimization in the 
classification context [21 [31 [TT] indicates that 

A4A/)< inf A0(A) + 
AeA(/) 

where |/| denotes the cardinality of the set / and the symbol < means that 
the inequality holds up to known multiplicative constants. Various choices are 
possible for the parameter p depending on the margin exponent a > defined 
in Section [21 Our target of interest, the oracle vector A* e A„, depends on f3. 
Formally, we define it as follows: 




Definition. Let C/j, = mini<i<M ||/j|| and let A* be the minimizer of 

1 

3A,(A) + 2f^V"'(r^|A|o)^, (8) 



over A S A„, where |A|o — X]f=i l-^d the number of non-zero coefficients of 
the vector A. 

Thus A* balances the approximation error, as measured by the excess risk 
A0(A), and the complexity of the parameter set A(J) to which A* belongs to, 
as measured by the regularization term (r'^|A|o)^/^^~^'^^. The constants 3 and 
2(8Ca,^)"'^^'-^~^'' can be changed: A decrease in the former will lead to a increase 
in the latter, and vice-versa. The constant can be avoided altogether if we 
take the penalty p(A) = 2r„ Y^^Li but in practice /x, and consequently 

II /ill, is unknown. Surely we could plug in estimates for ||/i|| as in [4} [5l [6l flT]. 
but we chose to keep the exposition and proofs as simple as possible. 

Let 

r^{i: K + 0} 
be the collection of non-zero coefficients of A*, 

M 
i=l 

be the cardinality of /* , and 

be the correlation between and fj . Our second assumption requires that 

p* = maxmax |p(i,_;/)| (9) 



is small: 
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Condition 2. Let = mini<j<M WfjW o.nd assume that 

12p*|A*|o <c^. (10) 

This mainly states that the submatrix (< fi, fj >)ijg/- is positive definite 
and that the correlations p(i,j) between elements fi, i € /*, of this submatrix 
and outside elements fj , j ^ I* , are relatively small. We refer to this assumption 
as the local mutual coherence assumption, see [4l[5l|6l[7]. 



2.3. Oracle inequality 

Instrumental in our argument is the random quantity 

{R^ — i?0)(A) — (i?0 — i?0)(A*) 

r = sup 

agA„ 



|A- A*|l +£,: 



(11) 



where we take £„ — 0(O)/(nr„). 

Our first result states the oracle inequality. It holds true as long as the tuning 
parameter r„ in the penalty term exceeds r. 

Theorem 1. Assume that ^ and I110\) hold. On the event r„ > r, 

A4A) + r„|A-A*|i < 3A4A*) + 2f^^V"'(r2|A*|o)^ + ^. 

(12) 

The next section discusses choices of the tuning parameter r„ that ensure 
that the probability of the event {r„ > f} is large. 



2.4. Choice of the tuning parameter rn 

The next lemma states that r is sharply concentrated around its mean. 
Lemma 2. Lei C_f = maxi<j<j\/ ||/j||oo. We have 

Q<?<2C^Cf (13) 

and, for all 5 > Q, 



P{f-E[f]><5}<exp(-i^) (14) 



F 



Proof. The first assertion follows directly from the definition of r. The second 
statement follows from an application of McDiarmid's bounded differences in- 
equality |10l Theorem 2.2, page 8] after observing that a change of a single pair 
[Xi, Yi) changes r by at most IC^Cpjn. □ 
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The range of r in (|13p is important for implementation of the method: We 
suggest to find a good value for r„ based on cross vahdation and the grid can 
be taken on the interval [0, 2C0Cf]. Inequality (|14p is important for theoretical 
considerations. It shows that we should take 



rn=m + f-^^C,C, (15) 
for some < d < 1, since then 

H^n >r}>l-S. 

The expected value E[r] is of order {log(M V n)/n}^^^ by the following lemma. 

Lemma 3. Let J„ be the smallest integer such that 2''" > n. Then, for all 
M, n > 1 and 0<S <1 

E[f] < I^1^^21og2(MVn)+ ^"^^^^ 



01 " ' ' 2(MVn)2' 

Consequently, 

Corollary 4. Assume that ^ and ilO\) hold, and take 



^ ^C4>Cf ^ , JnC^^Cp , ^ ^ / 21og(l/^) 
rn> —V2\og2{MVn) + ^^j^-—^ + Cc^CF\l z ■ (1^) 



Then oracle inequality ilS^) holds with probability at least 1 — 6. 



3. Example: generalized hinge loss 

Throughout this section, we consider a fixed cost d and a fixed threshold value 
T with < d < 1/2 and d < t < 1 — d. Instead of the discontinuous loss £{z) 
defined in ([3]), 1 considers the convex surrogate loss 

{1 — az if z < 0, 
l-z if < z < 1, (17) 
otherwise 

where a = {1 — d)/d > 1 and shows that the Bayes discriminant function /g 
defined in (g]) minimizes both the risks E[£{Yf{X))] and E[(l)d.{Y f{X))] over 
all measurable / : A" — > E. We see that (pdiz) > £{z) for all z S R as long as 
< r < 1 — d. Moreover, [1] shows that a relation like this holds not only for the 
loss functions and hence the risks, but for the excess risks as well. In particular, 
for all d < r < 1 — d, we have 



E [l{Yf{X))] - E [l{Yfo{X))] < E [MYf{X))] - E [UYfo{X))] . (18) 
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This is important since minimization of ([5]) produces oracle inequalities in terms 
of the (/>d-excess risk (Theorem[T]) , not in terms of the original excess risk directly. 
The latter risk has a sound statistical interpretation. 

For plug- in rules and empirical risk minimizers, [BEI] show that for classifi- 
cation with a reject option, fast rates (faster than n~^/^) for the excess risk may 
be obtained if the probability that rj{X), defined in ([T]), is close to the critical 
values of d and 1 — d, is small. More precisely, assume that there exist A > 1 
and a > such that for all t > 0, 

P{\v{X) -d\<t}< Ae and P{|?7(X) - (1 - d)\ <t} < Ae . (19) 

For d — 1/2, this asumption is equivalent to Tsybakov's margin condition |15j . 
Then, |lj Proof of Lemma 7] shows that 

^ (,),{EK(f.(X),/o(X))]}^ (20) 
- 2d{4A(l + |A|iCF)}" 

where 





\v\f 


-fo\ 


if 1] < d and / < — 1, 


PvifJo) = 1 




v)\f-M 


if > 1 — d and / > 1, 




[\f- 


fo\ 


otherwise. 



Following [13], we consider the measure /j, defined by 

^liB)= [ 77(x){l-r,(a;)}P(dx), (21) 
for any Borel set i?, where P is the probability measure of X. Since 

{h{x) - fo{x)y fi{dx) < (1 + \\\iCf) J \h{x) - fo{x)\ fi{dx), 

it follows from ([201) that condition ([7|) holds for all A with |A|i < Ca with 

Ca,^ = (l + CACi.)^(2d)5W{4A(l + CACj.)}^, (22) 

and f3 = a/{2 + 2a). 

Let A minimize the penalized empirical risk R^^(X) +p{X) over the restricted 

set 

A = {AeM*^ : |A|i <Ca} 
for some finite Ca and let A* minimize 

2 + 2q 

3A,,(A) + 2f^V"° (r^lAlo)^ (23) 
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over A e A. Provided then that the mutual coherence assumption pop holds, 
Corollary Instates that for all choices r„ = r„((5) in with = (1 — d)/d, 



with probability at least 1 — (5, where < (5 < 1 is given in (fTB]) . Consequently, 
via (PI), 

Theorem 5. Assume that fil9j) holds for some a > and that the dictionary 
Fm satisfies ilO\) with fi defined in 1121]) . Let A* G A &e as given in i2S\) . Then 
the minimizer A G A with rn as in jib]) with (5 = l/(nV M) and = {I — d)/d 
satisfies, for CA.fi defined in \22\) . 



with probability tending to 1 as n — s- cx). 

The best possible "rate" (r^ |A*|o)^^+"'/'^+"^ is achieved at a = +oo. The 
slowest possible rate is achieved at a = in which case imposes no restric- 
tion at all on ri{X). 

4. Proofs 

4.1. Proof of Theorem [7] 

Lemma 6. On the set r < rn, we have 




E[e{Yf^{X))] - E[£{Yfo{X))] + r„|A - A*|i < 




A4A)-A^(A*) + r„|A-A*|i <4r„^ |A,-A*|+r„e 



(24) 



Proof Rewrite © to obtain, for G(A) = ^(A) - i?(A), 

R^(X) - R4,{X*) < G{X*) - G(X) + p{X*) - p(X) 

< ?\X-X*\i+enr+p{X*)-p(X). 



On the event r„ > r then. 



A^A) - A^A*) < r„|A - A* li + e„r„ + p{X*) - p(X). 



Add r„|A — A*|i to both sides, and deduce 



A^(A)- A^(A*)+r„|A-A*|i 

< 2r„|A - A*|i + r„£„ + 2r„|A*|i - 2r„|A|i 



M 



<2r„^|A, -A*| + 2r„^|A,| 



+ 




1=1 iei' 
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which proves our claim. □ 
Lemma 7. 

c, E 1^' - ^ VIA - A*|i + |A*|;/'||f^_^J| (25) 
is/* 

Proof. See the proof of Theorem 2 of [71 pages 536, 537]. For completeness, we 
repeat the argument: Set 

M 

Clearly 

EE <-^»'.^j >"'"j ^0 

and so we obtain 

E"jii^jii' ^ iifA^A-ii'^EE^^^j- < •^^'•^j' > "^E E < -^''-^j' > 

- EE < -^^'-^J > 

< iif^_,jp + 2p*Ei"'iii/^iiEi"^ni/^-ii 

= ||f3^_^JP + 2p*C/*[/-p*([/*)2. (26) 

The left-hand side can be bounded by X^jG/* "^ll/jlP — (t^*)^/|A*|o using the 
Cauchy-Schwarz inequality, and we obtain that 

{U*f < ||f^_^jnA*|o + 2p*|A*|o;7*;7 

and, using the properties of a function of degree two in U{\), we further obtain 



U* < 2p*|A*|oC/+v^ll%_;,.ll (27) 

and the results follows from c^J^iei- I A* — A,*| < [/*. □ 

Combining both lemmas with the mutual coherence assumption immediately 
gives 

Lemma 8. On the event r„ > r, 

A^(A)- A^(A*) + ir„|A- A*|i < lr„|A*|y'||f^_^J| + r„e„ (28) 
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Finally we use the link between the L2{fi) norm of — /o and the excess risk 
A0(A) and Young's inequality that states 



so that, 



, aP bi p 

ab< 1 , p> 1, - 

p q p-l 



p pS^/'^P-^') 

for all a,b,6 > 0. From Lemma [3] above and condition ([7]), on the event > r, 

A^(A) - A^(A*) + ir„|A - A*|i < l^r„|A*|J/^{A^(A) + A^(A*)} + r„e„ 

Now use the above Young's inequality twice with p = 1/ f3, 6 — 1/2, b = 
4|''n-^*iy^C'A,p/cp and a = A^(A) and a = A^(A*), respectively, to deduce 

A^A)- A4A*) + ir„|A-A*|i 



< 



< 



^ {a,(A) + A,(A*)} + (1 - f3)\rlXX^-^^ ' 



This concludes the proof of Theorem [T] 



□ 



4.2. Proof of Lemma [1 

Let (71, . . . , (T„ be independent Rademacher variables, taking the values ±1, each 
with probability 1/2, independent of the data (Xi, Yi), . . . , (X„, Yn). Set 

1 " 



A standard symmctrization trick ([101 page 18]) shows that 
|G"(A)-GO(A*)|" 



,[?] < E 



AeA„ |A-A*|i+e„ 



< E 



\G"{X)-G°iX*)\ 



■E 



sup 

,<|A-A*|i<0(O)/r„ 



|G"(A)-G°(A*)| 
|A-A*|i + e„ 
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as |A — A*|i < (j>{0)/rn for all A G A„. The first term can be bounded as follows: 



(/) < — E 

< 

en 



sup G"(A)-G°(A*) 

|A-A'|i<e„ I 



sup 

|A-A*|i<e„ 



n 

- Va,y,fA-A.(x^) 

n ^ — ' 



by the contraction principle for Rademacher processes, see [TU pages 112 - 113]. 
This implies that 



(/) < ^E 



sup I A — A* 1 1 max 

|A-A-|i<e„ i<j<M 



1 " 



max 

l<j<M 



v/21og(2M) 



where we used [TOl Lemma 2.2, page 7] to get the last inequality. We can apply 
this result since 



E 



exp i s'^aiYifj{Xi) 



< exp(ns2c|./2) 



for all s, that follows in turn from |101 Lemma 2.1, page 5] . 

The second term (II) requires a peeling argument [111 page 70]. Since < 
r < 2C^Cf almost surely, we can use the bound 



E[//] < C + 2C^CfP{(//)>C}. 



(29) 



Observe that for any C > 0, and for Jn the smallest integer with 2'^"e„ > 0(0) /r„ 
or 2"^" > n, 



sup 



|GO(A)-G0(A*)| 



< 



< 



,<|A-A'|i<0(O)/r„ |A — A*|i+£ 

|GO(A)-G°(A 



sup 

2J-ie„<|A-A*|i<23e„ 



sup 

2J-ie„<|A-A* |i<236„ 



|A-A*|i+e„ 



G°(A)-G°(A*) 



Now, set 



Zj= sup G°(A)-G°(A*) 

A-A*li<23e„ 
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and the same considerations leading to the final bound of (I) above yield 



V21og(2M) 



and 



sup 

2J-ie„<|A-A'|i<2Je„ 



G°(A)-G"(A*) >2^-ie„C 



< ^P{Z, -E[Z,] >2J-i£„C-E[^j]}. 

A change of a single pair {Xi^Yi) changes Zj by at most 2C^Cf{'2-' en)/n, so 
that another application of the bounded differences inequality fTO| Theorem 2.2, 
page 8] gives, by taking 



C = GC^Cf 



^21og2(M Vn) 



the final bound 



Jr. 



J 2(C^CF2%„)221og(2AfV2n) 



^2 log(2Af V 2r 



(QCf2^e„)2 
J„ exp {-2 log(2Af V 2n)} 
J„i2M\/2n)-^. 



Invoke ((29)) to conclude the proof of Lemma [3l 



□ 
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